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ABSTRACT 

The development and validation of oral native 
language tests for Dutch adolescents was conducted for both 
assessment and curriculum development purposes at the secondary 
level. The oral language situations tested were the monologue, 
dialogue, and pclylogue or group conversation on a topic. An American 
inventory of common speech subjects was used to construct test 
situations, and the test subskills were drawn from the students' 
textbooks. The situations, subjects, and subskills were combined into 
three tests, which were administered to 14-year-olds in all areas of 
the country and all dialect regions. The tests were taped, and 
teachers were asked to rate the dialogue and monologue tests on 
content, organization, language, delivery, and communication. The 
interrater reliability and homogeneity of assessment criteria were 
analyzed statistically. On the whole, it was found that the two test 
types were closely correlated and that overall scores would have 
sufficed for ranking students in the case of almost all raters. 
However, when only one rater is available, as *n a classroom testing 
situation, the analytic method of scoring is preferred to the overall 
scoring method. (MSE) 
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, 1 Introductlor, 

For oral tests to be used In the course of the education process 
and meant to supporf It, that is for formative oral tests f as a rule 1 
♦ only direct and not Indirect testing methods will be adopted. That 1s 

Eh S *^ IL^M^J ? r !l ^i* W"* will actually have to speak. 
And, indeed, I think that Indirect measurer of the oral skill at an 
Intermediate level should be avoided, not only or In the first place 
for reasons of validity but with a view to 
* the motivation or pupils, as well as 
- the effects such tests have on education, whether Intended 
or not. 

™* 

^ Generally speaking testconstructors prefer their tests to 
, elicit behaviour that does not differ too much from the criterion 
. behaviour, the sort of behaviour that would occur In ordinary life 

a ^^5I!r^?Jfi C riJ s * 1n fact ' test 1$ ■ MBt to In everyday 

reality there are, generally speaklrra, three situations calling 

'« '££ fk?!Si S J rat10n T 0f ^ oral sk ! lf: the monologue, the dialogue 
and polylogue. In the large majority of situations falling Into 
+. any of these three main categories, each speaker can. to a large 
^extent, deteralr* the direction of the activity in which he is 
engaged. This Is selfevldent In ca«e of the monologue; in the other 
c ??? s thejjfluence exerted by the participants will not exactly be 
alike, but however that may be, it cannot be predicted with any 

S^Jy^J** "!"* speaker 1s «° 1n > 10 sa >- Even situations 
thatllmlt a speaker's free scope severely such as Interviews, e,g. 
between a senior and a junior staff member, even such situations remain 

25!^ e ?^^H a Kl ar9 Lf x J en J- jun10r suff *°** r «» TthT 1 

event In which he participates to take an unexpected turn. «f It Is 
thought Important that the behaviour elicited by the test and the 
criterion behaviour do not diverge too much, the test should create 
a situation which leaves the testee the scope he would have 1n 
everyday reality. 

Apart from this, •realistic' tests will give rise to fewer 
methodological problems in validating tests, that is, provided the 
test s reliability meets the proper requirements. I take for 
granted that & test which Is reliable and prompts behaviour that Is 
quite close to criterion behaviour can be considered * valid test. 
Only If the behaviour required by a test deviates more markedly 
from the behaviour that Is the real object of the evaluation, 
further validation Is called for. 

Of course I should make mention at this point of a d1fferenc2 
between testing the oral skills of LI *..d L2 learners. LI learners 
can be ussumed to avail themselves of a certain latitude In giving 
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direction to i conversation. But L2 learners it in elementary Uvel 
will not regret the constraints put upon their freedom of 
expression: thao would net know to use It If It were granted to them. 
Thit Is why strictly structured tests of simple vocabulary, 
pronunciation or sentence structure ire not it variance with the 
general principle of allowing the testee free scope. Incompatibility 
only arises at an advanced level of L2 proficiency. At this level 
requirements will be made of the L2 learner that come close to those 
made of the LI learner. What I am going to say about oral tests of Dutch 
for 14 year old Dutch pupils might consequently be of Interest to 
those concerned with testing L2 proficiency. 

First of all I shall say one or two thi *o**t the starting 

points adopted for the construction of the Ce* t I shall 

discuss the test formats and present the resuK etests 
and reting try-outs. 



2 Test content 

The oral tests for native speakers of Dutch developed so far 
are pert or a plan to compose a set of oral tests for the whole 
of secondary education* Tests for Intermediate levels have been 
constructed first, because 1t Is supposed that a series of such 
teste would contribute to the development of prooremd Instruction 
1a the ore! skill* Thet 1s to sqr. Individual teachers as well 
as writers of educational material could benefit from the existence 
o/ e set of flume tlve tests based on relevant educational objectives. 
An Inve nto ry of educational objectives through an analysis of most 
frequently used teftttoofcs showed a lack of any sort of systematic 



The series of tests was set up as follows. 

1 Orel proficiency was defined est the skill to express oneself 
orally 1n an adequate manner in functional language situations. 
The latter are considered to be language situations derived from 
an analysis of requlremsnts made of pupils both at school and 
out of school. The main criterion far categorising these 
sftuatjoee 1s the motor of speekersinvolved: monologue, dialogue 
a*d telyteouc. Each of the three main divisions 1s to comprise 
about ten langyago situations in edcordenco with the subcategories 
earning under the hoedlnes of monologue, dialogue end polylogue. 

2 In selecting relevant subjects and situations we used the 
Inventory presented 1n •iwlcs 1n speaking and listening for High 
School Ireduetes 4 by Ronald C. lessett at al. 1n Communication 
Education [Wtt 2*1 fl, which 1s an outcome of the minimal 
cemeetemclea movement In the USA. It lists all the situations 
calling ftor oral speech 1n which the average (American) dtlien 
may get involved, situations {hat have to do with one's occupation, 
with dtlienshlp and malntenace (private life). With a few adaptations to 
the Dutch situation, the 11st proved very helpful. 

* T !» test*' contents we.<e hrgoly determined by the relevance of 

!££!]!• 11 ■W"** from thatr featunno 1n textbooks. $ucfc 
suDSkvire arOjt 



dfcST COPY AVAILABLE 



-139- 



rtforaulatlng, reasoning dlitlngulsMng aajor natters from *1nor 
ones, ranking 'icts chrof»10fl1c*11y, expressing M opinion and 
(for the categories other thin tht moloout) asking and answering 

r ™l»"««9 to St txpmsTon of Motions. 
*u ^J!* 1 " «* 1c £ *• mntt * to bo no*, was bastd on 

the following consideration. Since Hit tests art mot for classroow 
ust enabling teachers to ascertain whether or not tholr pupils coaaund 
•-PfT t l cu1ar . c S -lp0,l#,,t of m} profldowy after Instructions or 
without spoclal training as tht cast asTht-analytlc assessment 
on the basis of a rnaabor of criteria will be sort 1n the Interest 
of the pupil and tho teacher than glooal aiiatfnt In this case 
agreeaont aamg raters would seen to bo leu layertant than 
consistency of tha ratings over a period of tie*. 

f . J«*" IM* 1 ! 1 " ^ f1tfl * "I th the fc*lp of sea* exaap.es how 

K\! f m% contwt *W co*Mnod into tests, 
inese iiMnu are; 

- speech situation (Mbr of speakers) 

- subject aatter/oMtext 

- subsklll. 

1L!?^1 0 ^ ) U * t K» n «■ • •ItwtfMi la uhlch he has to 

2? ■JLCLlli' SK ,S ■ N»» It sfcoun a tea •(ant* flla about 
£l2 «"S£*5fc/I te r« M "? ^"utaf' PMparttlaa ha then has to 

■o iia nin ne does gat a *1st af • madiar of onllaatory points. 
Suchi list proved an essential old Uaaaofy'far a vLt m* 

22! !; I5S T^S'JS** 1 i«*c. " is dear 

•hat wa art after la this •sslfnanatt tha Mil baa to live • report 
? !^^{. of & •"•I?"?) J» •tltaWoji Zch tight occur 
tL^Slin I! i ^, »,W«i.W ha has to d£onstrate 

tha subsklll of dlstlegulsulag bote** oajof prf tfner M tUrs. 



{ 'dialogue') tast puts tha pupil la a situation In 
which he hatto tarry pa • conversation, not tho Infernal everyday chat, 
but rather • aore or lass foraal Interview. Tha moll is unacted 
to ecjulra 1nfomst1on by asking questions la^onTeveryday^ • 
situations. For instance. to 1s supposed to ha neat 1m information 
ttTh. Pll-tSSlha to, 2X1 .SSS ctob 

"** * «• th1nkl»» of Joining, about a partlaiUTorganliad trip 
that he night «ant to eater for. The pupil tow a sketch of the 
situation irith the atslfanant to ^.*toioTffi5 nlnut. span.™ 
^ J^J^^tary to obuln the required iefotnatlOH 
SZKl^aV&V* "lEftaa.. After this thota Is the Interview 
fe? VI* In'VJJ? *■ U **' to tuppjyW^foneatlon „k*d 
mSlifJStAVt^ V**'' °! 1 ir of Mich the teacher 

selects hlibolf he gives ©aly evulve aaswars rooalrlng the pupil to 
keep on asking questions until he has received ifclo'a a„dc ,,r 
answer. The pupil mat not let Maaalf be pat 

^ » third test, a 'polylogut* . cones under 

l^l I r/l?;«*r* ,,,t •iW'**** t»ot thTtest is 
anant for IS to 14 year olds. Such a subject is 'public transport'. 
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SSmIT rM, ' ,,f * ,,twt10M "M* «P«ch is u sod as 
3 Pra- Uttt end rating sessions 

t~i£?4 0f £* COfl " r,1ntt Involvad In the ttst situation Is 
i! P ««Jl th " qtt ^ t, °? ,- " th,r th « bahavlour that U ttklUd can 
bt assassad property. To answer that question pratasts and rati nn 

|y wans of tko pratasts w hopod to find out *at mi «2 b2t' 
^1 ^".V* r^lo^SiT T^glon 

«~*t ^T^JK^a^^i^B u.t 

pleS iTf ""F MMloM «-t uwldtaka 

. P ^lSo^^^ 

fJvSs&'Sirs sss*!; *• *~ ° # — 

These sessions wars sat np Ilka this. 
KPOtTEO RATI IK SESSIONS 

MJED CtlTEMA SCALE GLOBAL 
CA«. MINTS SATINS 



No. 


MTBS 


TEST 


1. 


U # 


DIALOGUE* 
ANSNEX1NB 
OQtfMONS 


2. 


<C 


OlALOGuf 
ANJNtAINS 
QUBTJ0N3 


3. 


12 


MONOLOGUE 

kpont""" 

or FILN 



II* '/ 



•IDENTICAL 
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•n "UntlE^lfc^ wrt used: 

«cilt it cht f1«t to tE&I. crfttrli tnd * flour-point 
crlttrla and • Sff^J^^iL^^^ with four 
Tnt formats lc* FfcTj^ ,CiU * tUt third s as si on. 



WTIM6 MODEL S£SSI0H 1 and 2 DIALCSUE 
COATDIT 

1. brings up tho obligatory points 

2. brings up point, of Ms own 
choosing 



3. arts for ntctsiary t^cldatlon 

4. ropoatt htasalf i«noc»ss«r11y 
lamguak 

5. words and santonpts art sulUblf 

6. pronunciation and words art 
non-standard „ 

DELIVERY 

7. sptaki Utarly 

8. sptiks flutntly 

*. sptaks Monotonously 
COHfJUlCATIOi 

W. Maintains contact with 
IhUrloaitor 

H. Interrupts tht InUrlocutor 
GLOBAL RATING (attics fro. MO) 



SCALE POIhTS 
YES 10 

o 
0 

M * U H!L ALMOST 
■WES ALWAYS 

0 P o o 

• 0 0 o 

0 
0 

0 
0 
0 

0 
0 



0 
0 
0 

0 
0 



0 
0 

0 
0 
0 
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RATING MODEL SESSION 3 MONOLOGUE 



CRITERIA 



SCALE POINTS 

WEAK BARELY FAIR EXCELLENT 
SUFFICIENT 



0 
0 

0 
0 



1. CONTENT OOO 

Z. ORGANISATION 0 Q 0 

3. LANGUAGE 0 0 0 

4. DELIVERY 0 0 0 
GLOBAL RATING 

3 1 Consisting and rel lability of rattrs 

The first qutstlon I etntlontd above concerned the rtl lability 
and consistency of the rattrs. Consistency 1s, of court t, an 
aspect of reliability. Tht conslsttncy jf rattrs could be 

the 
would bt 

will not bt ablt to enlist the help oT" ottor* rettreriMswd^wre 
l! po r t rL!!? r ***** * * consistent It their assesnmts thut 
ftortjtcNtrt to agree with tech other: conslsttncy 1s aore 
t^ortent than Inter-rsttr 8a 




TA8U I. COEFFICIENTS OF CONSISTENCY IN DIFFERENT RATING SESSIONS 



RATER I 


.86 


RATER I 


.85 


RATER 3 


.79 


RATER 4 


.84 


RATER S 


.73 


RATER i 


.73 


RATtR 7 


.80 


RATER 6 


.73 


RATER 9 


.79 


RATER 10 


.74 


RATER U 


.86 
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Thtst corrtlitlons are relatively high which aeons that the raters 
art fairly consistent In thtlr Jurtjooants. 

In order to determine the rater's reliability - 1n the sense 
of ttrtr hooooenelty - a hcewgonelty analysis was used. This 
procedure assigns maters to sealepolnts for each rater. These 
numbers are called seal eve lyes, ty aaam of tliese sea leva lues the 
original data (the rating sheets filled out by the rotors) can be 
translated Into a maeHcal daUtable. For each rotor too 
correlations between the ee reel ved rating and tlie soon (- true) 
ratine can bo computed. The squert of that correlation Is the 
rater T t rel lability. 

Homogeneity analysis determines the scalovalugo for the scalepolnts 
n such a way. .hat the wan rater reliability - or hoaogenelty - 
1s as high as possible.) 

The rater's reliabilities for each of tSe throe sessions were as 
follows. 



TABLE. 2. RATER RELIABILITIES-DISCRIMINATION MEASURES 



RATER 


RATING 
SESSION 
1 2 


RATER 


RATING 
SESSION 
3 


1 


.80 


.81 


1 


.43 


2 


.74 


.74 


2 


.67 


3 


.62 


.69 


3 


.38 


4 


.71 


.73 


4 


.58 


S 


.Ei 


.59 


5 


41 


6 


.SO 


.^2 


6 


.43 


7 


.68 


.67 


7 


.55 


8 


.63 


.64 


8 


.43 


9 


.84 


.76 


9 


.6$ 


10 


.63 


.59 


10 


.51 


11 


.73 


.75 


11 
12 


.70 
.18 


Mean 


.68 


.70 




.55 



Th# d1 fftr.net In mm reliability batman Uw first two 
stations am tha third can ptrhaps ba account**" far by tha fact 
that tifftrMt rati** formats war* uttot tiM f»ra» fbnaat ustd 
tltvan critaria and fow scaiapolnts, tba ttcoiJ foraat ustd fo*ir 
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crlttHa (and 19*1* **r. but dlfftrtnt scaltplnts). Tht four 
crlttrla of thn tocond foratt aoy bo too coopUx to tt Wl^ 
urtMbl guous 1y ♦ 

High mm reliability *»s not Man of count, .Hat ill rattrt 
agread in tbtlr as$tt»ants. Tht analysis only Mtnrlnta tha 
ratars 1 tanpnalty, that Is, tha aitMt to uhlo, thtlr asstSMtnts 
pair, or also tlM tittnt to uHch om rattr's asststMnt can bt 
predicts on tHo bails of another ra tor's asstssMct. Tht rattrs 
will only giva Mre or loss similar asstssMnts, If tht scalo 
valuta that are assigned to tht scalt points fur tach rattr resaMlt 
tach othtr par scalt point, that Is, 1f thty havt only Httlt 
vaMarta. Tht scalt valuta obtalntd 1n tht thren staslona can bt 
sumaHstd aa folltM. 

TABLE 3. SCALE VALUES. MEANS AND STANDARDOEV RATIONS. 



RATING 
SESSION 


SCALE POINT 


MEAN SCALE 

MAI IIC 

VALUE. 


STAMARD 
ucftiii iun 


1 


RARELY 


• •#/ 


mVf 




SOMETIMES 


.00 


.19 




Of TEN 


M 


.19 


2 


AIMOST ALWAYS 
RARELY 


'..32 
-.•4 


.IS 
----- 




SOMETIMES 


.11 


.18 




OFTEN 


t7 


.16 


3 


AIMOST ALWAYS 
WEAK 


1.24 
-1.21 


.19 
.22 




RARELY SUFFI. 


-.13 


.14 




FAIN 


.12 


.11 




EXCELLENT 


1.32 


.26 



Frm tht d1ff*rtncas batwttn tht taan valuas 1t can bt 
conclude that tht rettra ^W"tly ustd tht am sea tpojjts, 
uHch Mtns that thty agrtad In thtlr asstsiMnu. Forlf this Mre 
tot tht cast, if thty had d1f fared Mre or last randMly tint and 
ap1n» Hit mm scaltvaluts would havt bttn 1dtnt1ca1. 



frm tha fact that. 1n all thret staslons, thtre Is a regular 
Incrtatt of tut mm voIm (tvtn though tht scalt points ustd are 
not tht saM) 1t 1s cltar that tit scalt polnta havt bMn Inttrprtttd 
stallarly tt a ctrtttn Mttnt. whltH 1s tt say that, fcr txaaplt. 
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often* has not systematically been regarded is suptrlor to 
'sometimes' In cist of t particular criterion by some raters 
«*™ s ot^r rittrt dealt with thU scale point the other *cy 

^ d f I t ~ I tho mn values clearly show when 

reliability Intervals art computed on tht basis of tht standard 
deviations found. As far as tht column of standard deviation Is 
concerned: tht smaller tht SO of tht scaltvtlytt assigned to a 
scalepolnt is. tht sort frequently this scalepolnt has bttn used 
by all raters at tht sane time. These standard deviations are low. 
Those for 'wtak 1 and 'excellent' art highest for tht third sess.on. 
This Is in accordance with tht relatively low teen rater rellaMlity 
for this session, as was shown In TABLE 2: the* two scalepolrus 
wtrt apparently Interpreted differently by different rater; . 

3.2 independence of criteria 

The second question regarded the Independence of criteria. It 
was expected that the criteria could not bVapplled entirely 
independently, since they relate to one complex skill, but on the 
other hand they were not supposed to be connected too closely, as 

ll^tL C !2 fir ^ cir1 f^Tl* or tvtn criterion wouid suffice. 
And then the so-called analytic assessment would not *ave any 
•{vantages over a global one. Analytic assessment has the advantage. 
^ c i"*^ «• of ttjts. that the pupils and their teacher can 
learn from the results at what points the pupil*' skills fall short 
and call for additional training. 

The homogeneity cf the criteria has been da^mlMd by weans 
of the sane scale analysis wcs was used for computing rater 
reliability. For each of the three sessions a Man criterion was 
determined. If all criteria correlated Mfh with thfs wan criterion, 
In fact one criterion would suffice. The next table shows the square 
correlations between the criteria and the wan cr'icarlon. 
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TABLE 4, 


DISCRIMINATION MEASURES CRITERIA 


CRITERIA 


RATING SESSION 


CRITERIA RATING 








SESSION 




i 


c 


3 






. w 


1 .62 


2 


• Ul 


ftl 

.Ui 


2 .72 


3 


>c 

• CJ 


• 10 


3 .59 


4 


• U J 


•u/ 


4 .66 


5 


CQ 

. oo 




MEAN .65 


6 




Aft 




7 


.57 


.67 




8 


.57 


.65 




9 


.40 


.57 




10 


.57 


.AO 




il 


.02 


.00 




MEAN 


.33 


,33 





Tht Man correlations art low, partlculary for tht first two 
stsslons. This mow* that tht crltorli do not ovtrlao coapltteTy. 
This Is mainly dut to tht criteria In tht first category, that 
of CONTENT. Togtthtr with criterion no. U - 'Interrupts 
Interlocutor* - thty art dtrlant. Tht four criteria uttd in tht 
third itsslon, which art as a matter of fact tht bat dings of tht 
four categorlts In tht first format, all correlate more clostly 
with tht aman criterion. (Thtst correlations cannot bt simply 
ccapared to tht othtrs, btctust tht criteria as tain as tht sea It 
points ifftrtd from thost uttd In tht first two stsslons.) 

In ordtr to tstebHsh to what txtent tht cattgorlts thtmstlvts 
art homogtntoiai, tht correlations bttwttn tht various criteria and 
tht mstn criterion ptr catefory havt bttn computed as wtll. 
Tht *mt tahlt shows a surety of tht (squire) ctrrelatlom. 

i 
i 
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TABIE S. DISCRIMINATION MEASURES PER CATEGORY Of CRITERIA 









SESSION 2 


CONTENT 


1 


.32 


fin 




2 


.26 


.05 




3 


.47 


.00 




4 


.24 




LANGUAGE 




.78 






r 


.78 


.'3 


DELIVERY 


7 


.71 


.7* 




8 


.5A 


.65 




9 


.S8 


.70 


COmUNICATION 


HO 


.56 ~ 


.57 ~ 




11 


M 


.17 



SStfStfttS s S5B ass sr*s.- 

TABLE fi. CORRELATIONS BETWEEN CATEGORIES 
MTING SESSION I mlm m%m , 

cat 2 .25 Cit2 „ 

cat J .27 .60 Mt 3 |2 M 

ct 4 . 05 -.01 .06 u% 4 ,., M _u 

ctt 1 cat 2 cat 3 Clt i cat t cat J 

5? "ttgory should ba split up; 
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reducing the number of criteria to no more than the category 
htajincis is unnecessary, y y 

the categories of LANGUAGE and DEL IVtPY in the first fonnat are 
the ones that could be combined e 



3 3 



Analytic and global rating 



The last question that we asked ourselves regarded the 

o e"e e d d tW ° ? f a »«*"t: global and anaiy k. The 
collected da.a do not point to a def.mtive conclusion Wp 

o P u1 TV esuUs'^ """V" this prefer^nce-is'not ruled 
out ov C ne results. Here are some figures 

1 K ^Mons's b 69 Ween markS 0bt31ned the flrst 

2 IS! C ?T e ! atl0 !! S betWeen the scores on th e mean criterion and 
the global marks given by the raters are 

session 1: 78 

session 2. .55 

session 3: 89 

3 The first principal components of session 1 and session 2 fas 

' 71 !h,s n v loeSf tn S h ment " C °- ,Cerned) showTcorrelicion 
tJ\^\\ a V ^rs S that be ^ 35 3 teSt " reteSt ™»*»^. 

a'sselsment C ' 0Se COrrc1atltn between th * two methods of 
" SUfflCed ^ a -king of pupils in 

- for classroom use, when only one teacher r an assess the 

achievements, the analytic assessment ,s to be preferred. 
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